skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Creators/Authors contains: "Garg, Kartikay"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Modern deep learning systems rely on (a) a handtuned neural network topology, (b) massive amounts of labelled training data, and (c) extensive training over large-scale compute resources to build a system that can perform efficient image classification or speech recognition. Unfortunately, we are still far away from implementing adaptive general purpose intelligent systems which would need to learn autonomously in unknown environments and may not have access to some or any of these three components. Reinforcement learning and evolutionary algorithm (EA) based methods circumvent this problem by continuously interacting with the environment and updating the models based on obtained rewards. However, deploying these algorithms on ubiquitous autonomous agents at the edge (robots/drones) demands extremely high energy-efficiency due to (i) tight power and energy budgets, (ii) continuous / lifelong interaction with the environment, (iii) intermittent or no connectivity to the cloud to offloadheavy-weight processing. To address this need, we present GENESYS, a HW-SW prototype of a EA-based learning system, that comprises of a closed loop learning engine called EvE and an inference engine called ADAM. EvE can evolve the topology and weights of neural networks completely in hardware for the task at hand, without requiring hand-optimization or backpropogation training. ADAM continuously interacts with the environment and is optimized for efficiently running the irregular neural networks generated by EvE. GENESYS identifies and leverages multiple unique avenues of parallelism unique to EAs that we term “gene”- level parallelism, and “population”-level parallelism. We ran GENESYS with a suite of environments from OpenAI gym and observed 2-5 orders of magnitude higher energy-efficiency over state-of-the-art embedded and desktop CPU and GPU systems. 
    more » « less
  2. This work is focused on analyzing potential performance improvements of HPC applications using stacked memories like the Hybrid Memory Cube, or HMC. We target a HPC sparse direct solver library, SuperLU [4], that performs LU decomposition and is a core piece of simulation codes like NIMROD [1]. To accelerate this library, we are interested in mapping both the computationally intense Spare Matrix-Vector (SpMV) kernels that can be implemented using matrix-matrix multiply (GEMM) calls and memory-intensive primitives like Scatter and Gather to a reconfigurable fabric tightly integrated with a 3D stacked memory. Here we provide initial results on mapping GEMM to OpenCL-based devices as well as a trace-driven evaluation of SuperLU's memory accesses with a combined FPGA and HMC platform. 
    more » « less
  3. Three-dimensional (3D)-stacked memories, such as the Hybrid Memory Cube (HMC), provide a promising solution for overcoming the bandwidth wall between processors and memory by integrating memory and logic dies in a single stack. Such memories also utilize a network-on-chip (NoC) to connect their internal structural elements and to enable scalability. This novel usage of NoCs enables numerous benefits such as high bandwidth and memory-level parallelism and creates future possibilities for efficient processing-in-memory techniques. However, the implications of such NoC integration on the performance characteristics of 3D-stacked memories in terms of memory access latency and bandwidth have not been fully explored. This paper addresses this knowledge gap (i) by characterizing an HMC prototype using Micron's AC-510 accelerator board and by revealing its access latency and bandwidth behaviors; and (ii) by investigating the implications of such behaviors on system- and software-level designs. Compared to traditional DDR-based memories, our examinations reveal the performance impacts of NoCs for current and future 3D-stacked memories and demonstrate how the packet-based protocol, internal queuing characteristics, traffic conditions, and other unique features of the HMC affects the performance of applications. 
    more » « less